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ABSTRACT 

Nuclear lamins contact the genome at the nuclear 
periphery through large domains and are involved 
in chromatin organization. Among broad peak call- 
ing algorithms available to date, none are suited for 
mapping lamin-genome interactions genome wide. 
We disclose a novel algorithm, enriched domain de- 
tector (EDD), for analysis of broad enrichment do- 
mains from chromatin immunoprecipitation (ChlP)- 
seq data. EDD enables discovery of genomic do- 
mains interacting with broadly distributed proteins, 
such as A- and B-type lamins affinity isolated by 
ChlP. The advantages of EDD over existing broad 
peak callers are sensitivity to domain width rather 
than enrichment strength at a particular site, and ro- 
bustness against local variations. 

INTRODUCTION 

The eukaryotic nucleus is bounded by the nuclear en- 
velope. The nuclear envelope consists of double mem- 
brane and, interfacing the inner membrane and chromatin, 
a meshwork of filamentous proteins called lamins (1). 
Lamins are involved in the regulation of many nuclear func- 
tions including chromatin organization (1,2). Mutations in 
lamin A (LMNA) cause diseases commonly referred to as 
laminopathies, which include partial lipodystrophies, myo- 
dystrophies or premature aging (3,4). Moreover, variations 
in B-type lamin level and distribution (in particular lamin 
Bl; LMNB1) have been associated with aging and senes- 
cence (5-8). A- and B-type lamins interact with chromatin 
through lamina-associated domains or LADs, of typically 
0.1 to 10 megabases (Mb) (9-13). LADs have initially been 
identified using DamID, an assay relying on the tagging of 
DNA sequences in proximity to nuclear lamins, and iden- 
tification of these sequences (2,9). Important features of 
LADs are their gene-poor content, the repressed state of 



genes within them, and their enrichment in heterochromatin 
(2,12,14). 

LADs have also been evidenced by chromatin immuno- 
precipitation (ChlP) of LMNA followed by array hybridiza- 
tion (15-17) and by ChlP of LMNB1 followed by high- 
throughput sequencing (ChlP-seq) (6,7). Lamins tend to be 
widely distributed on chromosomes, with regions of low oc- 
cupancy (6,7,9,11,12,16). Therefore, lamin ChlP-seq data 
differ in distribution and signal-to-noise ratio from more 
'conventional' ChlP-seq data for, for instance, focused hi- 
stone post-translational modifications (hPTMs) or tran- 
scription factors (TFs), which show narrow and strong en- 
richment (18,19). Broad and low-level enrichment cannot 
be detected by ChlP-seq peak callers, such as MACS which 
are designed to detect hPTMs or TFs in narrow windows 
(20). 

Several algorithms have been designed to detect broader 
peaks of enrichment. These include SICER, a clustering 
approach for domain identification (21); HPeak (22) and 
RSEG (23), two hidden Markov Model-based programs; 
PeakRanger (in particular the CCAT algorithm), detect- 
ing broad regions and summits within (24,25); and Broad- 
Peak which identifies wide peaks over a low-level profile 
(26). These programs are designed to discover regions of 
hPTM enrichment wider than peaks of TF binding; how- 
ever these regions are narrower than the megabase-size do- 
mains interacting with lamins (2), questioning the appli- 
cability of these algorithms to the detection of LADs. In 
addition, BroadPeak lacks support for 'input' chromatin 
sequences (26), i.e. sequences from fragmented chromatin 
not enriched in any specific protein by immunoprecipita- 
tion (unlike the ChlP sample) and commonly used as ref- 
erence against ChlP samples in the analysis. This makes 
BroadPeak unsuitable for analysis of ChlP-seq data that do 
not display a prominent difference between actual enrich- 
ment and background. SICER and PeakRanger detect pu- 
tative peaks based on the ChlP data alone, and only later in 
the analysis do they incorporate input data to evaluate the 
significance of the putative peaks (21,24). RSEG segments 
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the genome into foreground and background domains by 
identifying boundaries with significant transition probabil- 
ities, without taking the actual enrichment level in fore- 
ground domains into account (23). As lamin domains iden- 
tified by RSEG have very large genome coverage, with many 
domains displaying very low enrichment levels, we found 
that RSEG is too lenient in a lamin context (see below). 
These limitations may in practice be irrelevant when analyz- 
ing hTPM domains or similar ChlP-seq data; however they 
constitute a major hindrance in the analysis of ChlP-seq 
data for lamins and other broadly distributed chromatin- 
bound proteins. 

To alleviate these limitations, we developed an algorithm 
called enriched domain detector (EDD). We benchmark 
EDD against other broad peak callers using published 
lamin ChlP-seq data. We show that EDD enables quan- 
titative analysis of ChlP-seq data for proteins widely dis- 
tributed and with low-level enrichment on chromatin. We 
also demonstrate that EDD can discover genomic domains 
enriched in LMNA using new ChlP-seq data for LMNA. 
The main advantage of EDD over other peak callers is sen- 
sitivity to the width of enriched domains rather than enrich- 
ment strength at a particular site, and robustness against lo- 
cal variations. 



MATERIALS AND METHODS 

Cells 

Human normal dermal fibroblasts (Lonza CC-251 1; LDFs) 
and human normal primary dermal fibroblasts (Norwegian 
Stem Cell Center AD04DFs) were cultured in DMEM/F12 
with 13% FCS, 2 ng/ml basic fibroblast growth factor and 
antibiotics. Cells were exponentially growing and harvested 
at conffuency, at passage 5-7. AD04DFs were obtained with 
Norwegian Ethics Committee Approval REK2617A. 



Lamin A ChlP-seq 

Cells (10 7 per ChIP) were cross-linked in suspension for 10 
min in PBS containing 1% formaldehyde before quenching 
with 1 .25 mM glycine. Cells were lysed for 30 min at 4°C on 
a rotator in RIPA buffer (140 mM NaCl, 10 mM Tris-HCl, 
pH 8.0, 1 mM EDTA, 0.5 mM EGTA, 1% Triton X-100, 
0.1% SDS, 0.1% Na-deoxycholate, 1 mM PMSF, lx pro- 
tease inhibitor cocktail) adjusted to 1% SDS, and sonicated 
for 3x 15 min in a Bioruptor (Diagenode; 30 s on/off at 
high power) to generate chromatin fragments of ^200-400 
base pairs (bp). After sedimentation, chromatin was diluted 
10-fold in RIPA without SDS, and incubated on a rotator 
overnight at 4°C with 50 |xg lamin A/C antibody (Santa 
Cruz sc-7292) pre-coupled to magnetic Dynabeads Protein 
G (16) (Invitrogen). Irrelevant mouse IgGs were used as 
control. ChIP material was collected and washed 3x in 1 
ml ice-cold RIPA buffer. Crosslink was reversed and DNA 
eluted for 6 h on a shaker at 37° C in elution buffer (50 mM 
NaCl, 20 mM Tris-HCl, pH 7.5, 5 mM EDTA, 1% SDS) 
containing 0.5 |xg/ml RNase A and 2 |xg/ml Proteinase K. 
DNA was purified (16), the library was prepared (Illumina) 
and sequenced on an Illumina HiSeq2500. 



ChlP-seq data processing 

The following pipeline was used for analysis of all 
LMNA and LMNB1 data sets. Reads were aligned to 
the HG19 reference genome using Bowtie2 v2.1.0 with 
default parameters (27). Duplicate reads were removed 
using Picard's MarkDuplicates program with parameter 
REMOVE.DUPLICATES set to true (keeping duplicate 
reads does not significantly affect LAD detection be- 
cause of the large size of LADs; Supplementary Ta- 
ble SI). To avoid any normalization bias, we ensured 
that each pair of aligned input and ChIP read files had 
the same read depth, by using Picard's DownsampleSam 
program vl.86 (www.broadinstitute.Org/gatk//events/2038/ 
GATKwhO-BP-l-Map_and_Dedup.pdf) on the larger of 
the two files. 

Peak calling 

We used auto-estimated parameters, when possible, for all 
the peak callers considered. For parameters that had to be 
set manually, we scripted the peak calling process for each 
individual peak caller; this allowed testing a range of possi- 
ble values for the analysis. We then inspected the results in 
a genome browser after an initial screening process where 
we removed clearly suboptimal results (e.g. no coverage de- 
tected, or peak coverage close whole genome coverage). 

PeakRanger vl.17. We used the CCAT algorithm, de- 
signed for detection of broad peaks with a window size of 
500 bp and a step size of 50 bp. 

SICER vl.l. We used a window size of 200 bp and allowed 
gaps up to 600 bp. The LMNA reads are 51 bp and effective 
genome size was computed to 0.77; LMNB1 reads are 36 bp 
and effective genome size was computed to 0.72. Genome 
sizes were in both cases computed according to SICER's 
instructions. Fragment size was set to 300 bp and false dis- 
covery rate (FDR) cutoff to 0.1. 

BroadPeak. BroadPeak expects a source file with ChIP 
read counts per bin and does not directly support input read 
counts. A workaround has however been proposed (26); 
it consists in subtracting the number of input reads from 
the number of ChIP reads in bins with more ChIP reads 
than input reads, and setting the read count to 0 in the 
other bins. We tested both approaches (i.e. considering only 
ChIP reads, or ChIP reads with input read subtraction), and 
found that subtracting input reads gave the most convinc- 
ing results with our data. We were unable to run Broadpeak 
with other window sizes than the default 200 bp. 

RSEG vO.4.8. We used RSEG's deadzones program to 
find non-alignable regions in HG19 for both 36 bp (LMNB1 
ChIP) and 51 bp (LMNA ChIP) read lengths. We then used 
the rseg-diff program in mode 2 (for ChIP versus input) to 
analyze each pair of ChIP and input files with the appropri- 
ate deadzones file. 

EDD vl.0. LMNB1 data were analyzed using a 3 Kb bin 
size and a gap penalty of 12. LMNA data were analyzed 
using an 1 1 Kb bin size and a gap penalty of 5 for LDF 
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and AD04DF_repl and 4 for AD04DF_rep2. Confidence 
intervals for p (see the Results section) were determined us- 
ing the normal approximation method for binomial propor- 
tions. 



Lamin A ChlP-qPCR 

Purified LMNA ChIP DNA was eluted in 30 |xl H 2 0 and 
2.5 fxl used as template for quantitative polymerase chain 
reaction (qPCR) with primers listed in Supplementary Ta- 
ble S2. PCRs were run on a MyiQ Real-time machine with 
SYBR® Green (BioRad) in duplicates, with 95°C for 3 min 
and 40 cycles of 95°C for 30 s, 60°C for 30 s and 72°C for 
30 s. 



RNA-seq 

Total RNA was isolated using the Ambion TRIzol® 
Reagent RNA extraction kit (Life Technologies). A library 
was prepared (Illumina) and sequenced on an Illumina 
HiSeq2500. Reads (29.3 x 10 6 ) were aligned using Cufflinks 
and TopHat (28) with default parameters. 

Data viewing 

Browser views of gene tracks, ChlP-seq data and peaks 
are shown using Integrated Genomics Viewer (IGV; 
broadinstitute.org/igv) (29). Unless otherwise indicated, 
genes considered in the analyses are from the Illumina 
iGenomes gene annotation with UCSC data source 
for HG19 (https://support.illumina.com/sequencing/ 
sequencing_software/igenome.ilmn). 

Published data sets analyzed 

LMNB1 ChlP-seq and corresponding input sequence data 
(6) were downloaded from NCBI GEO accession number 
GSE49341. 



Data access 

Our LMNA ChlP-seq data are available under GEO acces- 
sion number GSE54334. 



RESULTS 

Seeking to identify megabase-size chromatin domains: devel- 
opment of EDD 

To palliate the current aforementioned limitations, we de- 
veloped EDD, an algorithm aimed to discover, from ChlP- 
seq data, megabase-size domains in a putatively 'noisy' en- 
vironment (Figure 1). EDD has been released as a Python 
package and is installable from Python Package Index 
(https://pypi.python.org). EDD source code and manual 
are freely available at http://github.com/CollasLab/edd. 

EDD aims to identify slight but significant enrichment 
over broad genomic regions. EDD is optimized for ChlP- 
seq analysis of proteins localized at the nuclear periphery, 
such as lamins. These proteins are known to be associ- 
ated with heterochromatin (30). Thus, EDD is distinct from 
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Figure 1. Work-flow diagram of EDD. 

any other broad peak callers in that it is designed for dis- 
covery of enriched domains in mainly, but not exclusively, 
heterochromatin regions. This is important to consider be- 
cause input chromatin samples display a non-uniform dis- 
tribution of reads across the genome (31,32), depending on 
the extent of chromatin compaction (compact chromatin is 
more difficult to reverse-crosslink and results in fewer se- 
quence reads). Therefore, a high number of ChIP reads does 
not per se provide an indication of enrichment; rather, the 
relationship between ChIP and input read counts is criti- 
cal. Moreover, sequenced reads provide only a sample of 
the whole information in the cell population. Thus, reads 
need to be aggregated into bins to provide sufficient approx- 
imation of the ChlP-to-input relationship. Accuracy of this 
relationship will increase with increasing numbers of reads, 
and thus with increasing bin size. EDD ensures that these 
properties are met, also in a heterochromatin context. An 
additional critical step in EDD is the identification of clus- 
ters of enriched bins: this is because EDD aims to identify 
large genomic compartments predicted from current knowl- 
edge of interactions of nuclear envelope proteins with the 
genome (2,12,14). 

Aggregating reads in genomic bins. The initialization step 
of the EDD algorithm is to bin the genome and count the 
actual number of ChIP and corresponding input sequence 
reads in each bin. EDD then calculates for each bin the sam- 
ple ratio p: 

p = Number of ChIP reads/ 

(Number of ChIP reads + Number of input reads). 

As described however, the ChIP and input- sequenced 
reads provide only a sample of the total information in the 
chromatin fraction of the cell population examined. Thus, 
the true ChlP/input signal in a bin, p, is unknown. We aim 
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to use p as an estimate of p, but only when p is deemed 
to be a reasonable estimate. To determine this, we compute 
the 95% confidence interval for p using the Agresti-Coull 
method (33). We observed that extreme values of p (i.e. p 
very close to 0 or 1) are almost exclusive to bins with few 
reads and a large confidence interval. We found downstream 
analysis to be more robust against noise if we ensured, by in- 
creasing bin size, that p was a good estimate for p in most 
bins: we only use p as an estimate of p in bins where the 
confidence interval is below a threshold, by default 0.25; we 
refer to this subset of bins as 'informative bins'. EDD se- 
lects the smallest bin size that generates at least 99% infor- 
mative bins, excluding bins without reads. As a result, three 
bin classes are generated: non-informative bins (NIBs), en- 
riched informative bins with p > 0.5 (EIBs; i.e. bins en- 
riched in lamin) and depleted informative bins with p < 0.5 
(DIBs; i.e. bins depleted of lamin). 

Bin scoring. Bins must be scored prior to searching for 
putative peaks. We seek to assign EIBs a positive score 
weighted on /?, assign DIBs a negative score weighted on 
p, and assign NIBs a weak negative score. Further, we seek 
to give a strong deterrent to prevent DIBs with p close to 0 
to be included in a peak, and similarly give EIBs with a p 
close to 1 a strong encouragement. The logit function: 

logitO) = log(/?) - log(l - p) 

meets these properties. 

We must additionally select a gap penalty that influ- 
ences the cost of peaks spanning DIBs and NIBs. A relative 
weak gap penalty will often result in the detection of very 
large domains, missing potentially interesting fluctuations 
within domains. Conversely, a disproportionally strong gap 
penalty will miss many domains with a slight heterogeneity. 
The choice of gap penalty depends on both the data ana- 
lyzed and the interests of the researcher. The bin scoring 
function is thus defined as: 

bin scoreO?) = logitO?), if p > 0.5 
bin scoreO?) = logitO?) * G, otherwise ' 

where G is the gap penalty. 

If the gap penalty is not specified, then EDD will choose 
the gap penalty that optimizes the function: 

EPR 5 * ECR, 

where EPR (enriched peak ratio) is the ratio of EIBs in 
peaks and ECR (enriched coverage ratio) is the ratio of EIBs 
in the whole genome that are covered by peaks. This is the 
function that best predicts the manually selected gap penal- 
ties for the data sets we have analyzed while developing and 
testing EDD. It is therefore important to inspect the results 
and potentially manually modify the gap penalty parame- 
ter for optimal results (we refer to the EDD manual online 
for additional information). Lastly, one must decide how to 
score the NIB bins, that is, the bins with too few reads to 
score based on p. As we have poor knowledge of their ac- 
tual enrichment level, we conservatively set their score to 
the median DIB score. 

Detection of clusters. EDD aims to detect significant clus- 
ters of EIBs to identify peaks. However as bin classification 



is imperfect, merely searching for contiguous EIB regions 
is too restrictive. We use a linear time algorithm for find- 
ing all maximal scoring subsequences (MSSs) (34); this al- 
gorithm is also used by BroadPeak. Given a sequence of 
real numbers (bin scores), the MSS algorithm finds the non- 
overlapping contiguous subsequences with the greatest to- 
tal scores. The emitted subsequences are potential peaks 
with a 'peak score' equal to the sum of bin scores within 
the subsequence. 

Significance testing. EDD seeks to identify EIB clusters 
that are highly unlikely to occur by chance; it relies on a 
Monte Carlo simulation where, for each trial, the order of 
the bins is shuffled throughout the genome, bin scores are 
kept constant and the score of the highest scoring maxi- 
mal subsequence is recorded (Figure 1). Note that EDD 
requires a list of unalignable regions for the organism an- 
alyzed, such as centromeres and telomeres, that should not 
be shuffled. These regions are all NIBs and would, if shuf- 
fled as any other bin, incorrectly decrease the score of the 
MSS per trial. Additional information on unalignable re- 
gions is provided in the EDD manual online. 

We observe that the theoretically lowest possible result of 
a Monte Carlo trial is equal to the highest scoring bin. Thus, 
we discard all observed putative peaks with a value equal 
to or less than the highest scoring bin prior to significance 
testing. For the remaining putative peaks, we compute P- 
values as: 

P-value(s)=^ (7 ^ )+1 , 

N+ 1 

where s is the score of a potential peak, T t is the result of 
Monte Carlo trial i and N is the total number of trials per- 
formed. Lastly, we use an FDR procedure [35] to adjust the 
P- values for multiple testing and report peaks with an FDR 
value below a user-set threshold. 

Configuring EDD. EDD has both required (e.g. input 
files) and optional (e.g. gap penalty) run- time arguments. In 
addition, EDD reads other parameters from a user config- 
urable file. The default values for these parameters should 
be sensible for most uses, but there might be situations 
where additional fine-tuning is required. Parameters such as 
the required percentage of informative bins, the confidence 
interval limit and the method (35) used to compute the con- 
fidence interval can be adjusted here (see the EDD manual 
online for additional information). 

Benchmarking EDD against published LMNB1 ChlP-seq 
data in relation to other broad peak callers 

We benchmarked EDD against existing broad peak callers 
including BroadPeak (26), PeakRanger (24,25), SICER (21) 
and RSEG (23) on published triplicate LMNB1 ChlP-seq 
and matched input sequence data accessed from NCBI 
GEO GSE49341 (6). For simplicity in the description of our 
analysis, we refer to genomic regions discovered by EDD or 
other algorithms as 'peaks' even though they are large do- 
mains rather than narrow and sharp peaks. Total genome 
coverage under peaks detected by these algorithms varies, 
from -1050-1500 Mb (EDD and RSEG, respectively) to 
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and again SICER peaks are narrower than typical LADs. 
Genome coverage of SICER peaks is also very low (Ta- 
ble 1) and SICER detects very few unique peaks relative to 
EDD (Figure 2e). Thus, SICER is not suitable for LAD dis- 
covery. RSEG: RSEG is also designed to detect hPTM do- 
mains (23) and was used to detect LMNB1 LADs by Sadaie 
et al. (6). Most domains detected by EDD are also discov- 
ered by RSEG RSEG also detects many unique domains 
(Figure 2e), but these show low enrichment compared to 
all domains detected by EDD (Figure 2c) and to EDD- 
only domains (Figure 2d, middle). Some of the RSEG do- 
mains also show 'negative' enrichment (Figure 2b and d), 
so with this data set RSEG is unable to exclusively discern 
enriched domains. We conclude that EDD is able to consis- 
tently discover large genomic domains enriched in LMNB1 
and therefore fulfills its purpose for the detection of LADs. 
This conclusion is further supported by the consistency be- 
tween LADs discovered by EDD from the LMNB1 ChlP- 
seq data, and lamin B LADs identified by DamID in human 
fibroblasts (9) and in the human HT1080 fibrosarcoma cell 
line (11) (Figure 2a). 



Figure 2. Benchmarking of EDD and other broad peak callers on LMNB 1 
ChlP-seq data, (a) IGV browser views of LMNB1 occupancy and LADs 
on chromosome 10, detected by EDD, RSEG, BroadPeak, SICER and 
PeakRanger from LMNB1 ChlP-seq data from human fibroblasts (6), and 
by DamID in human fibroblasts (Guelen LADs) (9) and HT1080 human 
fibrosarcoma cells (Meuleman LADs) (11) (DamID LADs). Tracks also 
show genes and LMNB1 log ChlP/Input ratios, (b) Median LMNB1 peak 
length detected by indicated peak callers (P = 0.000. . . ; Wilcoxon rank- 
sum tests; see Supplementary Table S3 for W- and P- values), (c) Median 
LMNB1 enrichment within peaks detected by indicated peak callers, (d) 
Median LMNB 1 enrichment within peaks uniquely detected by EDD ver- 
sus BroadPeak, RSEG, or SICER. (e) Venn diagram analysis of genome 
coverage under EDD, RSEG and SICER peaks. ChlP-seq data shown are 
all from replicate 3 of the Sadaie LMNB1 data set (6); similar results were 
obtained for replicates 1 and 2 in this data set (not shown). 



-720 Mb (BroadPeak), -260 Mb (SICER) and 9-25 Mb 
(PeakRanger) (Table 1). Numbers of peaks detected also 
vary, with SICER, BroadPeak and PeakRanger detecting 
large and variable numbers of peaks between replicates (Ta- 
ble 2). EDD detects 1803-1890 peaks in the three repli- 
cates, making it the most consistent of the algorithms tested 
(Table 2). Variations between algorithms can be explained 
by their design to detect peaks narrower than LADs: in- 
deed, PeakRanger, BroadPeak, SICER and RSEG detect 
peaks significantly narrower than EDD peaks (P = 0.0000; 
Wilcoxon rank-sum test; Figure 2a and b). The low coverage 
and peak length detected by PeakRanger (Table 2; Figure 
2a and b) reveal its unsuitability for LAD discovery; thus it 
was not further tested in our study. 

We next characterized LMNB1 enrichment within peaks 
detected by BroadPeak, SICER and RSEG in compari- 
son to EDD. BroadPeak: Peaks reported by BroadPeak dis- 
play a wide range of enrichment levels (Figure 2c); however 
peaks unique to BroadPeak relative to EDD strikingly in- 
clude regions of 'negative' enrichment (log ChlP/Input < 
0; Figure 2d, left). This is in sharp contrast to EDD-specific 
peaks which all show positive enrichment (Figure 2d, left). 
Thus, EDD prevails over BroadPeak for the detection of 
LADs. SICER: SICER is aimed to detect hPTMs (21) 



EDD identifies, from ChlP-seq data, megabase domains as- 
sociated with LMNA 

We applied EDD to ChlP-seq data we generated for A-type 
lamins (LMNA) in two human normal primary dermal fi- 
broblast cultures (LDFs and AD04DFs). LMNA and as- 
sociated DNA was immunoprecipitated using antibodies to 
LMNA, which we have recently validated for ChIP (16). To 
evaluate the performance of EDD on LMNA peak discov- 
ery from these data sets, we also benchmarked EDD against 
BroadPeak, RSEG and SICER. 

We slightly modified our original LMNA ChIP proto- 
col (16) by substituting cell lysis and ChIP buffers with a 
more stringent RIPA buffer to improve lamin solubiliza- 
tion and consistency of chromatin fragmentation. We ob- 
tained 25 to > 40 million reads for each LMNA ChIP and 
input samples. To visualize LMNA profiles, we calculated 
ratios of ChlP/input reads in 10 Kb bins throughout the 
genome. In the IGV browser, LMNA ChlP/input log ra- 
tios reveal large domains of LMNA enrichment in mainly 
gene-poor regions, and areas depleted of LMNA (Figure 
3a). Our LMNA ChlP-seq data were validated by ChlP- 
qPCR experiments for several promoter and intergenic re- 
gions (Figure 3b). The results show sites of high and low 
LMNA occupancy, in line with their localization within or 
outside LADs identified later using EDD. 

We next applied EDD to identify LMNA enriched do- 
mains and again benchmarked it against BroadPeak, RSEG 
and SICER (Figure 4a and b). (i) We find that EDD dis- 
covers -360-540 peaks, or LMNA LADs (Figure 4a, EDD 
peaks; Table 2), which altogether cover —700 Mb (23%) of 
the genome (Table 1). A browser view shows that LADs 
detected by EDD are included within the LMNA LADs 
mapped by DamID in HT1080 cells (11) (Figure 4b). The 
DamID LADs also appear to cover a wider fraction of 
the genome than the ChlP-seq LADs (Figure 4b), consis- 
tent with earlier observations that DamlD-derived LMNA 
LADs represent —50% of the genome (11) (versus 23% 
with our ChlP-seq LADs). LADs discovered by EDD range 
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Table 1. Genome coverage (in Mb) by EDD and current broad peak callers 
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a Raw data from ref. (6). 












Table 2. Number of peaks detected by EDD and current broad peak callers 
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a Raw data from ref. (6). 
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Figure 3. Identification of LMNA occupancy sites by ChlP-seq. (a) 
IGV browser views of LMNA occupancy throughout chromosome 7 in 
AD04DF (both ChlP-seq replicates) and LDF. Tracks show genes, FPKM 
values from RNA-seq data (AD04DFs) and LMNA log ChlP/Input ra- 
tios, (b) ChlP-qPCR analysis of LMNA enrichment on indicated promoter 
and intergenic sites (mean ± SD from three experiments) in AD04DF. 
LAD information (+, within LAD; -, outside LAD) is from our subse- 
quent LAD discovery using EDD. ChlP-qPCR data are consistent with 
LAD identification by ChlP-seq. 



from 0.2 to > 10 Mb, with a median size of ~1 Mb, which 
is significantly larger than the median size of peaks iden- 
tified by the other algorithms (median range of < 10 Kb 
to 100 Kb; P < 10~ 50 , Wilcoxon rank-sum test; Figure 4c). 
EDD peaks are all enriched in LMNA (Figure 4a; Sup- 
plementary Figure SI a), as expected from EDD's purpose. 
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Figure 4. Quantitative analysis of LMNA ChlP-seq data using EDD. (a) 
IGV browser views of domains enriched in LMNA discovered by EDD 
(EDD peaks) on chromosome 8 in AD04DF (2 replicates) and LDF. 
LMNA profiles are shown as log ChlP/Input ratios. Genes and FPKM 
values (AD04DF) are also shown, (b) LMNA peaks (LADs) identified by 
RSEG, BroadPeak, SICER and PeakRanger in AD04DF (both ChlP-seq 
replicates). DamlD-derived LMNA LADs in HT1080 cells (11) are also 
shown (bottom track), (c) Median peak length detected by EDD, Broad- 
Peak, SICER and RSEG in AD04DF_repl and rep2, and in LDF (*P < 
10 -75 relative to BroadPeak, SICER and RSEG; Wilcoxon rank-sum test; 
see Supplementary Table S3 for W- and P- values). 
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(ii) BroadPeak discovers a high number of peaks (Figure 
4b; Table 2), some strikingly covering regions of 'negative' 
enrichment (Figure 4b; Supplementary Figure SI a). Thus, 
BroadPeak does not appear to be suited for the discovery 
of LADs. (iii) RSEG identifies eight times more peaks than 
EDD (Table 2); however as for LMNB1, RSEG also delin- 
eates 'negatively enriched' regions (Figure 4b; Supplemen- 
tary Figure SI a). Thus, RSEG appears to be too lenient for 
the detection of LMNA LADs. (iv) Lastly, genome cover- 
age by SICER peaks is extremely low (Table 1), and SICER 
identifies narrow (and very few) peaks rather than actual 
domains (Figure 4b and c). This shows, as with the LMNB1 
data set, its unsuitability to accurately identify LADs. 

We demonstrate therefore that, in contrast to the avail- 
able broad peak callers tested, EDD is able to discover do- 
mains enriched in LMNA from ChlP-seq data sets. Our 
data are also notably the first, to our knowledge, to iden- 
tify of LMNA LADs from ChlP-seq data. 

LMNA LADs discovered by EDD are gene-poor and overall 
transcriptionally inactive 

Considering the overall gene-poor and lowly expressed state 
of LADs identified by DamID in earlier studies (2), we ex- 
amined the gene density and expression level of genes within 
peaks discovered by the different algorithms. From browser 
views, we note that EDD detects LMNA peaks mainly in 
gene-poor regions, identified by the absence of RNA-seq 
reads (no FPKM counts, Figure 4a). In fact, we calculated 
only 1.6 genes per Mb of EDD peak, while BroadPeak, 
RSEG and SICER peaks show higher gene density (P < 
10~ 3 ; Wilcoxon rank-sum test; Figure 5a; Supplementary 
Figure Sib). Thus, we conclude that LADs discovered by 
EDD are gene-poor. 

To qualify the relationship between LMNA detection 
and gene expression, we generated heat maps of LMNA 
level as a function of gene expression level, independently of 
LAD identification. The data show that LMNA level nega- 
tively correlates with gene expression (LDF, r 2 = 0.42, Fig- 
ure 5b; AD04DF, r 2 = 0.22; Supplementary Figure S2a). 
Further, repressed genes (FPKM = 0) show the highest level 
of LMNA (P < 10 -50 ; Wilcoxon rank-sum test) compared 
to weakly expressed (FPKM = 0-1) or highly expressed 
(FPKM > 1) genes (Figure 5c; Supplementary Figure S2b). 
Thus, LADs detected by EDD are associated with no or low 
gene expression, in agreement with the concept of LADs 
previously established by DamID (9). 

To specifically assess how EDD LADs relate to gene ex- 
pression, we determined the proportion of genes that are ex- 
pressed among all genes found within EDD, RSEG, SICER 
and BroadPeak peaks. We show from our RNA-seq data 
that the proportion of expressed genes (FPKM > 0) within 
EDD LADs (~15% 5 from a mean of 1.6 genes per Mb) is 
lower than that within RSEG, SICER or BroadPeak peaks 
(Figure 5d). The latter encompass 45% to 63% of expressed 
genes, which is similar to the proportion of all expressed 
genes in the genome (Figure 5d). These results indicate that 
LADs discovered by EDD are entirely consistent with the 
properties of LADs previously identified by a different assay 
(DamID), using different analysis methods and in different 
cell types (9-12). 
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Figure 5. LMNA LADs identified by EDD are gene-poor and overall tran- 
scriptionally inactive, (a) Number of protein-coding genes per megabase 
covered by indicated peak callers in AD04DF (*P < 0.001 , Wilcoxon rank- 
sum test; W = 0 for each comparison). Results for the other data sets are 
shown in Supplementary Figure Sib. (b) Heat map of LMNA enrichment 
on protein-coding genes of > 1 Kb as a function of gene expression level 
(LDF). (c) Median LMNA enrichment on protein-coding genes of > 1 
Kb that are repressed (FPKM = 0; 4789 genes), weakly expressed (FPKM 
= 0-1; 2179 genes) and highly expressed (FPKM > 1; 2097 genes). *P < 
10~ 50 relative to FPKM = 0-1 or FPKM >1; Wilcoxon rank-sum test: 
see Supplementary Table S3 for W- and P-values. Data for AD04DF are 
shown in Supplementary Figure S2a and b and Table S3, (d) Percentage 
of expressed protein-coding genes (FPKM > 0) in peaks discovered by 
indicated peak callers and among all protein-coding genes; *P = 10 -321 , 
Fisher's exact test relative to all genes. 



We conclude that EDD's conception fulfills its require- 
ment of reproducible discovery of broad LADs from ChlP- 
seq data. EDD is globally more robust than the other peak 
callers tested against spatially restricted variations in en- 
richment level. Our data are also the first to report the 
discovery of LMNA LADs by ChlP-seq. Discovery and 
analysis of genomic domains interacting with lamins using 
ChIP and EDD will not only expand our understanding 
of nuclear envelope-genome interactions, but also enable 
high-resolution mapping of putative variations in lamin- 
genome interactions during development and in the context 
of lamin-linked diseases (3,4). 

DISCUSSION 

We present a new genomic domain caller, EDD, for the dis- 
covery of broad genomic enrichment areas from ChlP-seq 
data, against reference input sequence data. The main ad- 
vantages of EDD over other broad peak callers are its sen- 
sitivity to the size of domains rather than the strength of 
enrichment at a particular site, and its robustness against 
local variations. Thus, EDD caters a niche that enables 
quantitative analysis of ChlP-seq data, for example nu- 
clear envelope- and chromatin-associated proteins such as 
lamins, and other widely distributed chromatin-bound pro- 
teins. In addition, EDD is uniquely performant with data 
showing low-level enrichment over wide genomic regions. 

Beside LADs, recent work has identified other large chro- 
matin domains potentially amenable to mapping with EDD 
(36). For instance, the nuclear envelope protein LAP2a, 
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a lamin- and chromatin-associated protein, displays wide 
nuclear distribution (37). HMGN5, a histone-like protein, 
also shows wide genomic distribution and a preference 
for heterochromatin (38). Regions enriched in the methyl- 
CpG-binding protein MeCP2, which is spread over methy- 
lated DNA but shows enriched regions (39), are also good 
candidates for mapping using EDD. Additional large ge- 
nomic domains include (i) domains of hPTMs such as 
H3R2mel, H3K9mel, H3K9me3, H3K79mel, H3K79me3 
or H2BK5mel (18,40); (ii) wide H3K4me3 or H3K27me3 
domains emerging during senescence in culture (6,7,41); (hi) 
domains occupied by histone variants (42,43); (iv) large 
organized chromatin lysine modifications or LOCKs (44) 
shown to overlap with (v) DNA hypomethylated blocks in 
cancer cells (45). EDD enables the identification of chro- 
matin domains with robustness against local variations, and 
may prove valuable to detect large-scale epigenetic changes, 
some of which are predictive of cancer (46). 

Elements still remain under consideration for improve- 
ment in EDD's performance. EDD is designed to detect 
megabase-size domains; therefore EDD will miss narrow 
(1-10 Kb) regions of enrichment if the adjacent regions 
are not enriched. Similarly, narrow depleted regions within 
highly enriched megabase domains might be included in a 
'peak' (domain) in cases where it would be preferable to sub- 
divide the domain. Deeper sequencing and stronger signal- 
to-noise ratios are two possible ways to improve EDD's sen- 
sitivity, as this allows for a smaller bin size. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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